No Colab: saltando montaje de Drive.

Load Libraries¶

dataset_modified: (39235, 62)
No Colab: saltando montaje de Drive.

Load data¶

✓ Cargado data/processed/clean.csv -> (39235, 62)
dtypes (primeras 8):
url                         string[python]
timedelta                            Int64
n_tokens_title                     Float64
n_tokens_content                   Float64
n_unique_tokens                    Float64
n_non_stop_words                   Float64
n_non_stop_unique_tokens           Float64
num_hrefs                          Float64
dtype: object
url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs ... max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares mixed_type_col
0 http://mashable.com/2013/01/07/amazon-instant-... 731 12.0 219.0 0.663594 1.0 0.815385 4.0 2.0 1.0 ... 0.7 -0.35 -0.6 -0.1 0.5 -0.1875 0.0 0.1875 593.0 493
1 http://mashable.com/2013/01/07/ap-samsung-spon... 731 9.0 255.0 0.604743 1.0 0.791946 3.0 1.0 1.0 ... 0.7 -0.11875 -0.125 -0.1 0.0 0.0 0.5 0.0 711.0 639
2 http://mashable.com/2013/01/07/apple-40-billio... 731 9.0 211.0 0.57513 1.0 0.663866 3.0 1.0 1.0 ... 1.0 -0.466667 -0.8 -0.133333 0.0 0.0 0.5 0.0 1500.0 493
3 http://mashable.com/2013/01/07/astronaut-notre... 731 9.0 531.0 0.503788 1.0 0.665635 9.0 0.0 1.0 ... 0.8 -0.369697 -0.6 -0.166667 0.0 0.0 0.5 0.0 1200.0 688
4 http://mashable.com/2013/01/07/att-u-verse-apps/ 731 13.0 1072.0 0.415646 1.0 0.54089 19.0 12.842425 20.0 ... 1.0 -0.220192 -0.5 -0.05 0.454545 0.136364 0.045455 0.136364 505.0 579
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
39230 http://mashable.com/2014/12/27/samsung-app-aut... 8 16.575486 346.0 0.529052 1.0 0.684783 9.0 7.0 1.0 ... 0.75 -0.26 -0.5 -0.125 0.1 0.0 0.4 0.0 1800.0 253
39231 http://mashable.com/2014/12/27/seth-rogen-jame... 8 12.0 328.0 0.696296 1.0 0.885057 9.0 7.0 3.0 ... 0.7 -0.211111 -0.4 -0.1 0.3 1.0 0.2 1.0 1900.0 493
39232 http://mashable.com/2014/12/27/son-pays-off-mo... 8 10.0 442.0 0.516355 1.0 0.644128 24.0 1.0 12.0 ... 0.5 -0.356439 -0.8 -0.166667 0.454545 0.136364 0.045455 0.136364 1900.0 555
39233 http://mashable.com/2014/12/27/ukraine-blasts/ 8 6.0 682.0 0.539493 1.0 0.692661 10.0 1.0 1.0 ... 0.5 -0.253332 -0.5 -0.0125 0.0 0.0 0.5 0.0 1100.0 493
39234 http://mashable.com/2014/12/27/youtube-channel... 8 10.0 1822.919305 0.701987 1.0 0.846154 1.0 1.0 0.0 ... 0.5 -0.2 -0.2 -0.2 0.333333 0.25 0.166667 0.25 1300.0 703

39235 rows × 62 columns

url                             string[python]
timedelta                                Int64
n_tokens_title                         Float64
n_tokens_content                       Float64
n_unique_tokens                        Float64
                                     ...      
title_sentiment_polarity               Float64
abs_title_subjectivity                 Float64
abs_title_sentiment_polarity           Float64
shares                                 Float64
mixed_type_col                           Int64
Length: 62, dtype: object

Step 1 EDA - Clean Dataframe and describe columns¶

Classes and functions to clean columns and insert into pipeline¶

Define column type¶

(1, 14, 47)

Classify numeric columns¶

(18, 2, 3)

Preprocess columns¶

Original shape : (39235, 62)
Cleaned shape  : (39235, 62)
Any NA in url? : False
Duplicate urls : False

Describe the columns¶

LDA_00                 float64
LDA_01                 float64
LDA_02                 float64
LDA_03                 float64
LDA_04                 float64
                        ...   
weekday_is_friday        int64
weekday_is_saturday      int64
weekday_is_sunday        int64
is_weekend               int64
url                     object
Length: 62, dtype: object
count unique top freq mean std min 25% 50% 75% max
LDA_00 39235.0 NaN NaN NaN 0.1844 0.261623 0.010289 0.025182 0.033402 0.240389 0.926994
LDA_01 39235.0 NaN NaN NaN 0.142881 0.220038 0.01029 0.025032 0.033348 0.156095 0.925947
LDA_02 39235.0 NaN NaN NaN 0.215967 0.28094 0.010005 0.028572 0.040007 0.327956 0.919999
LDA_03 39235.0 NaN NaN NaN 0.222839 0.293553 0.010838 0.028572 0.040001 0.366368 0.926534
LDA_04 39235.0 NaN NaN NaN 0.233914 0.28832 0.010679 0.02858 0.047619 0.396494 0.927191
... ... ... ... ... ... ... ... ... ... ... ...
weekday_is_friday 39235.0 NaN NaN NaN 0.140232 0.347232 0.0 0.0 0.0 0.0 1.0
weekday_is_saturday 39235.0 NaN NaN NaN 0.060176 0.237815 0.0 0.0 0.0 0.0 1.0
weekday_is_sunday 39235.0 NaN NaN NaN 0.06721 0.250389 0.0 0.0 0.0 0.0 1.0
is_weekend 39235.0 NaN NaN NaN 0.127386 0.333409 0.0 0.0 0.0 0.0 1.0
url 39235 39235 http://mashable.com/2013/01/07/amazon-instant-... 1 NaN NaN NaN NaN NaN NaN NaN

62 rows × 11 columns

Step 2 EDA - graphs¶

Function to graph¶

/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
/tmp/ipykernel_103080/1188193524.py:43: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df[col],
No description has been provided for this image

ML model¶

Split dataframe for Train, Validation and Test and drop columns that are redundant¶

Agregar analisis de correlacion¶

No description has been provided for this image
The following columns where removed from the X¶
  1. average_token_length since it is an average correlated to n_tokens_content
  2. kw_avg_min since it is an average correlated to kw_min_min and kw_max_min
  3. kw_avg_max since it is an average correlated to kw_max_min and kw_max_max
  4. self_reference_min_shares since it is normally data obtained after models have been deployed
  5. self_reference_max_shares since it is normally data obtained after models have been deployed
  6. self_reference_avg_sharess since it is normally data obtained after models have been deployed
  7. is_weekend since we have columns for saturday and sunday
  8. weekday_is_sunday since we can determine by knowing if the other days did not apply
  9. avg_positive_polarity since it is an abg correlated to min_positive_polarity and max_positive_polarity
  10. avg_negative_polarity since it is an abg correlated to min_negative_polarity and max_negative_polarity
  11. url since it is a string column that can be used as the index
  12. shares since that is our output
  13. kw_avg_avg since it is correlated to kw_max_avg and kw_min_avg
  14. timedelta since it is a not predictive column
Original dataframe (39235, 62)
Cleaned dataframe (39235, 62)
X_train (27464, 48)
X_val (5885, 48)
X_test (5886, 48)
y_train (27464,)
y_val (5885,)
y_test (5886,)

Pipeline to improve features distributions¶

48
Dimensión de los datos de entrada:
antes de aplicar las transformaciones: (27464, 48)
después de aplicar las transformaciones: (27464, 48)

Histogram after pipeline¶

LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 abs_title_sentiment_polarity abs_title_subjectivity global_rate_negative_words global_rate_positive_words global_sentiment_polarity ... data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday
38677 1.549072 -0.424829 0.846910 0.444943 -0.734875 1.383421 -0.772061 0.0 0.042486 1.293728 ... 1 0 0 0 0 0 0 1 0 0
5096 0.835936 -0.551436 -0.746486 1.556585 -0.809549 1.383421 -1.315197 0.0 -1.852629 -4.404387 ... 0 0 0 0 0 0 1 0 0 0
26446 -0.556042 -0.426287 1.657392 -0.650452 -0.738698 -0.883153 0.864344 0.0 -0.004052 0.752190 ... 0 0 0 1 0 1 0 0 0 0
5588 -0.419554 -0.241316 1.493192 -0.539757 0.723072 -0.466910 0.316645 0.0 0.042486 -0.169173 ... 0 0 0 1 0 0 0 0 1 0
16614 0.704288 -0.849119 -0.915578 1.593365 -0.958407 -0.883153 0.864344 0.0 0.389315 1.048159 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 48 columns

No description has been provided for this image

Merge X_train and X_test¶

Dimensión de las variables de entrada ANTES de las transformaciones: (33350, 48)
Dimensión de las variables de entrada DESPUÉS de las transformaciones: (33350, 48)

functions¶

ML models¶

Regresion lineal¶

Find best parameters while using column transformation
Fitting 15 folds for each of 8 candidates, totalling 120 fits
>> Linear_Regression
Mejor RMSE (CV): 3974.5696 usando {'model__copy_X': True, 'model__fit_intercept': True, 'model__positive': False}
------------------------------------------------------------------------------------------
rmse : 3974.5696 – 4081.8782  (std avg 90.238)
mae  : 2340.2179 – 2381.9597  (std avg 32.343)
mape : 1.5913 – 1.6280  (std avg 0.325)
r2   : -0.0035 – 0.0486  (std avg 0.006)
------------------------------------------------------------------------------------------

Find best parameters while using column transformation Fitting 15 folds for each of 8 candidates, totalling 120 fits

Linear_Regression Mejor RMSE (CV): 3974.5696 usando {'model__copy_X': True, 'model__fit_intercept': True, 'model__positive': False}


rmse : 3974.5696 – 4081.8782 (std avg 90.238) mae : 2340.2179 – 2381.9597 (std avg 32.343) mape : 1.5913 – 1.6280 (std avg 0.325) r2 : -0.0035 – 0.0486 (std avg 0.006)¶

✅ Artefactos guardados:
- models/Linear_Regression.joblib
- models/KNN.joblib
- reports/models/cv_results_summary.csv
- reports/models/metrics.json

k-Vecinos Más Cercanos (kNN)italicized text¶

Find best parameters while using column transformation
Fitting 15 folds for each of 8 candidates, totalling 120 fits
>> K_neighbors_nearest
Mejor RMSE (CV): 4014.1910 usando {'model__algorithm': 'auto', 'model__n_neighbors': 21, 'model__p': 1, 'model__weights': 'uniform'}
------------------------------------------------------------------------------------------
rmse : 4014.1910 – 4271.3955  (std avg 91.015)
mae  : 2274.1715 – 2450.0750  (std avg 40.211)
mape : 1.4403 – 1.5850  (std avg 0.330)
r2   : -0.0990 – 0.0295  (std avg 0.010)
------------------------------------------------------------------------------------------

Decision tree¶

Find best parameters while using column transformation Fitting 15 folds for each of 12 candidates, totalling 180 fits

Decision_tree Mejor RMSE (CV): 23400.2251 usando {'model__criterion': 'absolute_error', 'model__max_depth': 7, 'model__max_features': 'sqrt'}


rmse : 23400.2251 – 39427.2034 (std avg 5531.391) mae : 3138.4086 – 6596.0320 (std avg 385.893) mape : 0.6508 – 3.4537 (std avg 0.298) r2 : -2.5399 – -0.0223 (std avg 0.896)

Random Forest¶

XGBoosting¶

Neuronal network MLP¶

Support vector machine SVM¶